supports collective training with programs by gavin1332 · Pull Request #18392 · PaddlePaddle/Paddle

gavin1332 · 2019-06-28T01:28:30Z

test=develop

Since allreduce op has 4 reduce types, We split these four reduce types into four ops;
We also refined the collective op code, e.g. we separated the collective op kernel into CPUKernel and CUDAKernel, and remove the device specified DeviceContext parameter in template as we already knew the target DeviceContext;
We remove the newly added Collective op role to reduce the complexity of program and graph analysis.
Append a new ParamAttr 'distributed' for distributed parameter specification, which means no grad allreduced for these parameters;

test=develop

guru4elephant

Please take care of the distributed arguments in ParamAttr

guru4elephant · 2019-06-28T05:59:07Z

python/paddle/fluid/param_attr.py

                 gradient_clip=None,
-                 do_model_average=False):
+                 do_model_average=False,
+                 distributed=False):


do you have example for distributed=True？
I wonder is it possible to inference the value of distributed through op type?

solution is found, so that we will remove this attribute from ParamAttr. And as it is in the Distributed FC domain, we will handle this parameter in the next pr.

guru4elephant · 2019-06-28T06:46:27Z

python/paddle/fluid/transpiler/collective.py


-    def __init__(self):
-        Collective.__init__(self)
+    def __init__(self, nrings=2):


The minimum number for parallel comms/streams. As it is no harm for paralleling collective communication in GradAllReduce mode, we prefer this value as the default to 1 which refers to no parallel at all.

guru4elephant

and also please remove shard_index_op in this PR

test=develop

gavin1332 · 2019-06-28T08:24:27Z

and also please remove shard_index_op in this PR

as the shard_index_op is also in Distributed FC domain, we have removed it.

test=develop

chengduoZH · 2019-07-01T02:24:40Z

paddle/fluid/framework/op_proto_maker.cc

+           static_cast<int>(OpRole::kCollective) |
+               static_cast<int>(OpRole::kBackward),
+           static_cast<int>(OpRole::kCollective) |
+               static_cast<int>(OpRole::kOptimize),


Op role will increase the complexity of the Graph and Program analysis. I do not recommend adding new o prole.

i agree with u, and i will try to remove newly added op roles. However, as collective ops swap data among trainers, their behaviors is different from backward and optimize ops more or less, i will create a new pr to discuss this topic if necessary. 3ks.

chengduoZH · 2019-07-01T02:30:40Z

Since allreduce op has 4 reduce types, the type of 'sum' has corresponding gradient calculation, but the others, 'max', 'min', 'prod' do not. So we split these four reduce types into four ops;

If the operator in all reduce is sum, does it mean that all reduce is used for gradient aggregation? What logic is this?

As collective ops could be used in backward and optimization phase,

Why? This is an unreasonable assumption.

gavin1332 · 2019-07-01T02:58:18Z

If the operator in all reduce is sum, does it mean that all reduce is used for gradient aggregation? What logic is this?

we are trying to introduce a model parallel strategy to train extreme large classification problem in face recognition, which has up to 10 million classes and the size of the last fc parameter is beyond the GPU memory. So that we have to separate parameter into multiple cards and call collective ops in the forward phase besides the gradient aggregation.

gavin1332 · 2019-07-01T03:03:08Z

As collective ops could be used in backward and optimization phase,

Why? This is an unreasonable assumption.

recent research has been done for algorithms to accelerate deep learning training, namely LocalSGD, which allreduce and average the parameters in the optimization phase instead of allreducing the gradient in the backward phase, which our assumption is based on.

guru4elephant

LGTM. Should add collective op backward ut if possible.

…nalysis test=develop

guru4elephant

LGTM

gavin1332 requested review from LLMHao and guru4elephant June 28, 2019 01:34

gavin1332 force-pushed the develop branch 3 times, most recently from a2a7b5d to dbe90e1 Compare June 28, 2019 03:00

supports collective training with programs

8a6f8a0

test=develop

gavin1332 force-pushed the develop branch from dbe90e1 to 8a6f8a0 Compare June 28, 2019 03:20

LLMHao previously approved these changes Jun 28, 2019

View reviewed changes

guru4elephant reviewed Jun 28, 2019

View reviewed changes

remove unrelated changes in this pr

6b5ce60

test=develop

gavin1332 dismissed LLMHao’s stale review via 6b5ce60 June 28, 2019 08:08

minor upate

4496f2c

test=develop

gavin1332 requested a review from chengduoZH June 28, 2019 08:22

minor update

3b27978

test=develop

gavin1332 force-pushed the develop branch from a10a08e to 3b27978 Compare June 29, 2019 10:40

chengduoZH reviewed Jul 1, 2019

View reviewed changes

gavin1332 force-pushed the develop branch from 10f3a5e to 22de7d0 Compare July 1, 2019 05:22

guru4elephant self-requested a review July 2, 2019 01:21

guru4elephant previously approved these changes Jul 2, 2019

View reviewed changes

remove collective op role to reduce the complexity of program/graph a…

200f102

…nalysis test=develop

gavin1332 dismissed guru4elephant’s stale review via 200f102 July 2, 2019 01:54

gavin1332 force-pushed the develop branch from 22de7d0 to 200f102 Compare July 2, 2019 01:54

guru4elephant self-requested a review July 2, 2019 06:42

guru4elephant approved these changes Jul 2, 2019

View reviewed changes

chengduoZH approved these changes Jul 2, 2019

View reviewed changes

gavin1332 merged commit a873fa8 into PaddlePaddle:develop Jul 2, 2019

Conversation

gavin1332 commented Jun 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guru4elephant left a comment

Choose a reason for hiding this comment

Uh oh!

guru4elephant Jun 28, 2019

Choose a reason for hiding this comment

Uh oh!

gavin1332 Jun 28, 2019

Choose a reason for hiding this comment

Uh oh!

guru4elephant Jun 28, 2019

Choose a reason for hiding this comment

Uh oh!

gavin1332 Jun 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guru4elephant left a comment

Choose a reason for hiding this comment

Uh oh!

gavin1332 commented Jun 28, 2019

Uh oh!

chengduoZH Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

gavin1332 Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

gavin1332 Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

chengduoZH commented Jul 1, 2019

Uh oh!

gavin1332 commented Jul 1, 2019

Uh oh!

gavin1332 commented Jul 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guru4elephant left a comment

Choose a reason for hiding this comment

Uh oh!

guru4elephant left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gavin1332 commented Jun 28, 2019 •

edited

Loading

gavin1332 Jun 28, 2019 •

edited

Loading

gavin1332 commented Jul 1, 2019 •

edited

Loading